On Linking Heterogeneous Dataset Collections
نویسندگان
چکیده
Link discovery is the problem of linking entities between two or more datasets, based on some (possibly unknown) specification. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n) comparisons by clustering entities into blocks, and limiting the evaluation of link specifications to entity pairs within blocks. Current link-discovery blocking methods explicitly assume that two RDF datasets are provided as input, and need to be linked. In this paper, we assume instead that two heterogeneous dataset collections, comprising arbitrary numbers of RDF and tabular datasets, are provided as input. We show that data model heterogeneity can be addressed by representing RDF datasets as property tables. We also propose an unsupervised technique called dataset mapping that maps datasets from one collection to the other, and is shown to be compatible with existing clustering methods. Dataset mapping is empirically evaluated on three real-world test collections ranging over government and constitutional domains, and shown to improve two established baselines.
منابع مشابه
A two-step blocking scheme learner for scalable link discovery
A two-step procedure for learning a link-discovery blocking scheme is presented. Link discovery is the problem of linking entities between two or more datasets. Identifying owl:sameAs links is an important, special case. A blocking scheme is a one-to-many mapping from entities to blocks. Blocking methods avoid O(n) comparisons by clustering entities into blocks, and limiting the evaluation of l...
متن کاملEntity Linking to One Thousand Knowledge Bases
We address the task of entity linking to multiple knowledge bases (KB). In particular, we investigate the use of over one thousand domain-specific KBs derived from Wikia.com collections in conjunction with the Wikipedia collection as a background-knowledge repository. Our system employs a two-step approach: for each document, a supervised model with a large set of features detects whether there...
متن کاملExploration of Audiovisual Heritage Using Audio Indexing Technology
This paper discusses audio indexing tools that have been implemented for the disclosure of Dutch audiovisual cultural heritage collections. It explains the role of language models and their adaptation to historical settings and the adaptation of acoustic models for homogeneous audio collections. In addition to the benefits of cross-media linking, the requirements for successful tuning and impro...
متن کاملSPE-174907-MS Rapid Data Integration and Analysis for Upstream Oil and Gas Applications
The increasingly large number of sensors and instruments in the oil and gas industry, along with novel means of communication in the enterprise has led to a corresponding increase in the volume of data that is recorded in various information repositories. The variety of information sources is also expanding: from traditional relational databases to time series data, social network communication...
متن کاملModularity based community detection in heterogeneous networks
Heterogeneous networks are networks consisting of different types of nodes and multiple types of edges linking such nodes. While community detection has been extensively developed as a useful technique for analyzing networks that contain only one type of nodes, very few community detection techniques have been developed for heterogeneous networks. In this paper, we propose a modularity based co...
متن کامل